Goto

Collaborating Authors

 teaching assistant


Human or AI? Comparing Design Thinking Assessments by Teaching Assistants and Bots

Khan, Sumbul, Liow, Wei Ting, Ang, Lay Kee

arXiv.org Artificial Intelligence

ORCID: 0000 -0003-2811-1194 Abstract --As design thinking education is growing in secondary and tertiary education, educators face a mounting challenge of evaluating creative artefacts that comprise visual and textual elements. Traditional, rubric-based methods of assessment are laborious, time-consuming, and inconsistent, due to their reliance on Teaching Assistants (TAs) in large, multi - section cohorts. This paper presents an exploratory study to investigate the reliability and perceived accuracy of AI -assisted assessment vis -à -vis TA-assisted assessment in evaluating student posters in design thinking education. Two activities were conducted with 33 Ministry of Education (MOE), Singapore school teachers, with the objective (1) to compare AI -generated scores with TA grading across three key dimensions: empathy and user understanding, identification of pain points and opportunities, and visual communication, and (2) to understand teacher preferences for AI-assigned, TA-assigned, and hybrid scores. Results showed low statistical agreement between instructor and AI scores for empathy and pain points, though slightly higher alignment for visual communication. Teachers generally preferred TA -assigned scores in six of ten samples. Qualitative feedback highlighted AI's potential for formative feedback, consistency, and student self -reflection, but raised concerns about its limitations in capturing contextual nuance and creative insight. The study underscores the need for hybrid assessment models that integrate computational efficiency with human insights . This research contributes to the evolving conversation around responsible AI adoption in creative disciplines, emphasizing the balance between automation and human judgment for scalable and pedagogically sound assessment practices. Design thinking is a human-centered approach to innovation that draws from the designer's toolkit to integrate the needs of people, the possibilities of technology, and the requirements for business success. It is a non - linear, iterative process that teams use to understand users, challenge assumptions, redefine problems, and create innovative solutions to prototype and test.


AI-Assisted Pleural Effusion Volume Estimation from Contrast-Enhanced CT Images

Basu, Sanhita, Fröding, Tomas, Kahraman, Ali Teymur, Toumpanakis, Dimitris, Sjöblom, Tobias

arXiv.org Artificial Intelligence

Background: Pleural Effusions (PE) is a common finding in many different clinical conditions, but accurately measuring their volume from CT scans is challenging. Purpose: To improve PE segmentation and quantification for enhanced clinical management, we have developed and trained a semi-supervised deep learning framework on contrast-enhanced CT volumes. Materials and Methods: This retrospective study collected CT Pulmonary Angiogram (CTPA) data from internal and external datasets. A subset of 100 cases was manually annotated for model training, while the remaining cases were used for testing and validation. A novel semi-supervised deep learning framework, Teacher-Teaching Assistant-Student (TTAS), was developed and used to enable efficient training in non-segmented examinations. Segmentation performance was compared to that of state-of-the-art models. Results: 100 patients (mean age, 72 years, 28 [standard deviation]; 55 men) were included in the study. The TTAS model demonstrated superior segmentation performance compared to state-of-the-art models, achieving a mean Dice score of 0.82 (95% CI, 0.79 - 0.84) versus 0.73 for nnU-Net (p < 0.0001, Student's T test). Additionally, TTAS exhibited a four-fold lower mean Absolute Volume Difference (AbVD) of 6.49 mL (95% CI, 4.80 - 8.20) compared to nnU-Net's AbVD of 23.16 mL (p < 0.0001). Conclusion: The developed TTAS framework offered superior PE segmentation, aiding accurate volume determination from CT scans.


Small Language Models for Curriculum-based Guidance

Katharakis, Konstantinos, Rossi, Sippo, Mukkamala, Raghava Rao

arXiv.org Artificial Intelligence

The adoption of generative AI and large language models (LLMs) in education is still emerging. In this study, we explore the development and evaluation of AI teaching assistants that provide curriculum-based guidance using a retrieval-augmented generation (RAG) pipeline applied to selected open-source small language models (SLMs). We benchmarked eight SLMs, including LLaMA 3.1, IBM Granite 3.3, and Gemma 3 (7-17B parameters), against GPT-4o. Our findings show that with proper prompting and targeted retrieval, SLMs can match LLMs in delivering accurate, pedagogically aligned responses. Importantly, SLMs offer significant sustainability benefits due to their lower computational and energy requirements, enabling real-time use on consumer-grade hardware without depending on cloud infrastructure. This makes them not only cost-effective and privacy-preserving but also environmentally responsible, positioning them as viable AI teaching assistants for educational institutions aiming to scale personalized learning in a sustainable and energy-efficient manner.


Bringing Pedagogy into Focus: Evaluating Virtual Teaching Assistants' Question-Answering in Asynchronous Learning Environments

Siyan, Li, Xu, Zhen, Raghuram, Vethavikashini Chithrra, Zhang, Xuanming, Yu, Renzhe, Yu, Zhou

arXiv.org Artificial Intelligence

Asynchronous learning environments (ALEs) are widely adopted for formal and informal learning, but timely and personalized support is often limited. In this context, Virtual Teaching Assistants (VTAs) can potentially reduce the workload of instructors, but rigorous and pedagogically sound evaluation is essential. Existing assessments often rely on surface-level metrics and lack sufficient grounding in educational theories, making it difficult to meaningfully compare the pedagogical effectiveness of different VTA systems. To bridge this gap, we propose an evaluation framework rooted in learning sciences and tailored to asynchronous forum discussions, a common VTA deployment context in ALE. We construct classifiers using expert annotations of VTA responses on a diverse set of forum posts. We evaluate the effectiveness of our classifiers, identifying approaches that improve accuracy as well as challenges that hinder generalization. Our work establishes a foundation for theory-driven evaluation of VTA systems, paving the way for more pedagogically effective AI in education.


A Humanoid Social Robot as a Teaching Assistant in the Classroom

Sievers, Thomas

arXiv.org Artificial Intelligence

Although innovation and the support of new technologies are much needed to ease the burden on the education system, social robots in schools to help teachers with educational tasks are rare. Child-Robot Interaction (CRI) could support teachers and add an embodied social component to modern multi-modal and multi-sensory learning environments already in use. The social robot Pepper, connected to the Large Language Model (LLM) ChatGPT, was used in a high school classroom to teach new learning content to groups of students. I tested the technical possibilities with the robot on site and asked the students about their acceptance and perceived usefulness of teaching with the help of a social robot. All participants felt that the robot's presentation of the learning material was appropriate or at least partially appropriate and that its use made sense.


A Large-Scale Real-World Evaluation of LLM-Based Virtual Teaching Assistant

Kweon, Sunjun, Nam, Sooyohn, Lim, Hyunseung, Hong, Hwajung, Choi, Edward

arXiv.org Artificial Intelligence

Virtual Teaching Assistants (VTAs) powered by Large Language Models (LLMs) have the potential to enhance student learning by providing instant feedback and facilitating multi-turn interactions. However, empirical studies on their effectiveness and acceptance in real-world classrooms are limited, leaving their practical impact uncertain. In this study, we develop an LLM-based VTA and deploy it in an introductory AI programming course with 477 graduate students. To assess how student perceptions of the VTA's performance evolve over time, we conduct three rounds of comprehensive surveys at different stages of the course. Additionally, we analyze 3,869 student--VTA interaction pairs to identify common question types and engagement patterns. We then compare these interactions with traditional student--human instructor interactions to evaluate the VTA's role in the learning process. Through a large-scale empirical study and interaction analysis, we assess the feasibility of deploying VTAs in real-world classrooms and identify key challenges for broader adoption. Finally, we release the source code of our VTA system, fostering future advancements in AI-driven education: \texttt{https://github.com/sean0042/VTA}.


AI Mentors for Student Projects: Spotting Early Issues in Computer Science Proposals

Aher, Gati, Schmucker, Robin, Mitchell, Tom, Lipton, Zachary C.

arXiv.org Artificial Intelligence

When executed well, project-based learning (PBL) engages students' intrinsic motivation, encourages students to learn far beyond a course's limited curriculum, and prepares students to think critically and maturely about the skills and tools at their disposal. However, educators experience mixed results when using PBL in their classrooms: some students thrive with minimal guidance and others flounder. Early evaluation of project proposals could help educators determine which students need more support, yet evaluating project proposals and student aptitude is time-consuming and difficult to scale. In this work, we design, implement, and conduct an initial user study ( n = 36) for a software system that collects project proposals and aptitude information to support educators in determining whether a student is ready to engage with PBL. We find that (1) users perceived the system as helpful for writing project proposals and identifying tools and technologies to learn more about, (2) educator ratings indicate that users with less technical experience in the project topic tend to write lower-quality project proposals, and (3) GPT-4o's ratings show agreement with educator ratings. While the prospect of using LLMs to rate the quality of students' project proposals is promising, its long-term effectiveness strongly hinges on future efforts at characterizing indicators that reliably predict students' success and motivation to learn.


PyEvalAI: AI-assisted evaluation of Jupyter Notebooks for immediate personalized feedback

Wandel, Nils, Stotko, David, Schier, Alexander, Klein, Reinhard

arXiv.org Artificial Intelligence

Grading student assignments in STEM courses is a laborious and repetitive task for tutors, often requiring a week to assess an entire class. For students, this delay of feedback prevents iterating on incorrect solutions, hampers learning, and increases stress when exercise scores determine admission to the final exam. Recent advances in AI-assisted education, such as automated grading and tutoring systems, aim to address these challenges by providing immediate feedback and reducing grading workload. However, existing solutions often fall short due to privacy concerns, reliance on proprietary closed-source models, lack of support for combining Markdown, LaTeX and Python code, or excluding course tutors from the grading process. To overcome these limitations, we introduce PyEvalAI, an AI-assisted evaluation system, which automatically scores Jupyter notebooks using a combination of unit tests and a locally hosted language model to preserve privacy. Our approach is free, open-source, and ensures tutors maintain full control over the grading process. A case study demonstrates its effectiveness in improving feedback speed and grading efficiency for exercises in a university-level course on numerics.


Prompt-Based Cost-Effective Evaluation and Operation of ChatGPT as a Computer Programming Teaching Assistant

Ballestero-Ribó, Marc, Ortiz-Martínez, Daniel

arXiv.org Artificial Intelligence

The dream of achieving a student-teacher ratio of 1:1 is closer than ever thanks to the emergence of large language models (LLMs). One potential application of these models in the educational field would be to provide feedback to students in university introductory programming courses, so that a student struggling to solve a basic implementation problem could seek help from an LLM available 24/7. This article focuses on studying three aspects related to such an application. First, the performance of two well-known models, GPT-3.5T and GPT-4T, in providing feedback to students is evaluated. The empirical results showed that GPT-4T performs much better than GPT-3.5T, however, it is not yet ready for use in a real-world scenario. This is due to the possibility of generating incorrect information that potential users may not always be able to detect. Second, the article proposes a carefully designed prompt using in-context learning techniques that allows automating important parts of the evaluation process, as well as providing a lower bound for the fraction of feedbacks containing incorrect information, saving time and effort. This was possible because the resulting feedback has a programmatically analyzable structure that incorporates diagnostic information about the LLM's performance in solving the requested task. Third, the article also suggests a possible strategy for implementing a practical learning tool based on LLMs, which is rooted on the proposed prompting techniques. This strategy opens up a whole range of interesting possibilities from a pedagogical perspective.


Large Language Models in Computer Science Education: A Systematic Literature Review

Raihan, Nishat, Siddiq, Mohammed Latif, Santos, Joanna C. S., Zampieri, Marcos

arXiv.org Artificial Intelligence

Large language models (LLMs) are becoming increasingly better at a wide range of Natural Language Processing tasks (NLP), such as text generation and understanding. Recently, these models have extended their capabilities to coding tasks, bridging the gap between natural languages (NL) and programming languages (PL). Foundational models such as the Generative Pre-trained Transformer (GPT) and LLaMA series have set strong baseline performances in various NL and PL tasks. Additionally, several models have been fine-tuned specifically for code generation, showing significant improvements in code-related applications. Both foundational and fine-tuned models are increasingly used in education, helping students write, debug, and understand code. We present a comprehensive systematic literature review to examine the impact of LLMs in computer science and computer engineering education. We analyze their effectiveness in enhancing the learning experience, supporting personalized education, and aiding educators in curriculum development. We address five research questions to uncover insights into how LLMs contribute to educational outcomes, identify challenges, and suggest directions for future research.